Data about the hybrid open access uptake is critical. Although policy recommendations have addressed open and transparent workflows in recent years, identifying hybrid open access articles and funding sources remains challenging. In this post, we show how to mine such data from Elsevier. Between 2015 and July 2019, Elsevier’s subscription journals published 63,577 hybrid open access articles, representing 2.3% of the overall publication volume of these journals. A data analysis reveals a growing uptake of agreements between Elsevier and funders to cover costs for open access. Not surprisingly, mostly British and Dutch funders sponsor hybrid open access. But also the German Federal Ministry of Education and Research is well represented despite the current Elsevier boycott from most universities and research organizations in Germany. Nevertheless, the majority of funding sources is still unknown, raising important questions about the transparency of this publishing model.
In September 2018, the cOAltion S, a group of international research funders including the European Commission, announced its widely discussed Plan S. According to its principles, publication fees that may arise when publishing open access should be covered by funders or research organizations. Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that most authors do not pay such fees out of their pockets, publishers rarely share such evidence. But also not all funding organizations and research institutions disclose open access sponsorship at the article-level despite existing attempts to crowd-source such data(Jahn and Tullney 2016). Prominent examples sharing details about publication fee spending openly are the British Charity Open Access Fund maintained by the Wellcome Trust or the Open APC Initiative.
This blogpost presents a dataset comprising publicly available spending information from Elsevier, a major publisher. The dataset, which was retrieved from Crossref and open access full-texts, serves as critical input to ongoing discussions around transition subscription journals to open access. The methods used to obtain the data not only address key challenges discovering hybrid open access articles along with funding and affiliation information. Elsevier’s effort to make such data openly available also provides a good practice example for the ongoing development of metadata standards and workflows recommendations that address transformative agreements between publishers and libraries, a discussion led by the ESAC initiative.
As more specific examples, the dataset will be used to analyzed to present the growth hybrid open access among Elsevier journals, and to contrast these figures with the overall publication volume of its subscription-based journals. Drawing on Elsevier’s funding information, it will be furthermore investigate if publication fees were billed to the authors, covered by a funder as part of an agreement with Elsevier, or waived. Moreover, text-mined author email domains will be presented as rough approximation of the affiliation of the first resp. corresponding author, an important data point for delineating open access funding.
The resulting dataset is openly available on GitHub along with the source code.
As a start, the Elsevier publication fee price list shared as pdf document was used. The rOpenSci tabulizer package allowed to extract data about these journals from this file.
Methods follow the Hybrid Open Access Journal Dashboard, an interactive analytical application from the SUB Göttingen to monitor the longitudinal development of this publishing model at a large-scale. Instead of using spending data from the Open APC Initiative, the Elsevier publication fee price list shared as pdf document was used to obtain hybrid open access journals. The rOpenSci tabulizer package allowed to extract the data from this file.
Next, Crossref REST API was queried to discover open access articles published in these journals, as well as to retrieve yearly article volumes for the period 2015 - 2019. Methods follow the Hybrid Open Access Journal Dashboard, an interactive analytical application from the SUB Göttingen. Using the rcrossref client, developed and maintained by the rOpenSci initiative, the first API call retrieved all license URLs available per journal. I also drew on facet field counts to obtain the yearly article volume per journal from Crossref. After matching license URLs indicating open access articles, a second API call checked licensing metadata per journal. Here, using the Crossref’s REST API filters license.url and license.delay allowed to exclude delayed open access articles. For every immediate open access article, comprehensive Crossref metadata was obtained including links to full-texts.
Elsevier provides access to full-texts as html and xml document via the Crossref Text and Data Mining Services (Crossref-TDM). Surprisingly, the xml representation not only contains the full-text, but also comprehensive metadata including information about open access sponsorship shown below.
<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
Arts and Humanities Research Council
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>
Snapshot of open access metadata in Elsevier XML full. https://api.elsevier.com/content/article/pii/S1475158518302261
After interfacing the Elsevier full-texts with the crminer package, a client maintained by rOpenSci, this open access information was extracted from the xml-based full-text.
Moreover, the first author email address was parsed using pattern matching, assuming that email domains roughly indicate the affiliation of the first respective corresponding author at the time of publication. Next, the email domains was split in its parts with urltools.
The resulting dataset comprises the following variables, and is openly shared via GitHub.
First ten rows
library(rmarkdown)
hybrid_df <- readr::read_csv("data/els_hybrid_info_normalized.csv")
paged_table(head(hybrid_df, 10))
It must be noted, however, that open access information from Elsevier full-text was not documented at the time of writing this blogpost.
In total, the dataset comprises 63,577 hybrid open access articles from 1,703 hybrid open access journals published between January 2015 and July 2019.
Using this datasets, the share of hybrid open access articles per journal was calculated. To explore variations among journals, Bob Rudis ggeconodist package was used. The package does a great job replicating a boxplot aesthetics from The Economist magazine.
The figure shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first seven months in 2019. 1,703 of 1,985 subscription journals from Elsevier offering hybrid open access did in fact publish at least one article under this model, corresponding to an share of 86 %.
Elsevier usually requires authors to pay a publication fee, also known as article processing charge (APC) to publish open access. Many authors make use of funding from grant agencies or academic institutions to cover such fees. To streamline this process, some funding bodies and institutions have agreed central payment options for affiliated researcher. Elsevier also provides APC waivers.
In most cases, payment notifications were send to the authors paid directly 59 %. Elsevier lists a funding body covering the open access publication fee for around one third of articles.
The following interactive visualization let’s you browse for funders as disclosed by Elsevier.
Mostly British and Dutch funders sponsored hybrid open access in Elsevier journals. But also the German Federal Ministry of Education and Research (BMBF) is well represented despite the current boycott from most universities and research organizations in Germany. Since 2018, the BMBF financially supported 152 hybrid open access articles that appeared in 110 Elsevier journals according to the publisher.
In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first or of the corresponding authors, respectively, a data point used to delineate open access funding. In the following, a hierarchical, interactive treemap visualizes the distribution of the email domains. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. The size of each rectangle is proportional to the number of hybrid open access articles corresponding to this domain.
Dallmeier-Tiessen, Suenje, Robert Darby, Bettina Goerner, Jenni Hyppoelae, Peter Igo-Kemenes, Deborah Kahn, Simon C. Lambert, et al. 2011. “Highlights from the SOAP Project Survey. What Scientists Think About Open Access Publishing.” http://arxiv.org/abs/1101.5260.
Jahn, Najko, and Marco Tullney. 2016. “A Study of Institutional Spending on Open Access Publication Fees in Germany.” PeerJ 4 (August). PeerJ: e2323. https://doi.org/10.7717/peerj.2323.
Solomon, David J., and Bo-Christer Björk. 2011. “Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal.” Journal of the Association for Information Science and Technology 63 (1). Wiley-Blackwell: 98–107. https://doi.org/10.1002/asi.21660.